Diacritics correction in Turkish with context-aware sequence to sequence modeling

نویسندگان

چکیده

Digital texts in many languages have examples of missing or misused diacritics which makes it hard for natural language processing applications to disambiguate the meaning words. Therefore, restoration is a crucial step languages. In this study we approach problem as bidirectional transformation diacritical letters and their ASCII counterparts, rather than unidirectional diacritic restoration. We propose context-aware character-level sequence model transformation. The independent sense that no language-specific feature extraction necessary other utilization word embeddings directly applicable trained Turkish correction task assessment used tweets benchmark dataset. Our best setting proposed improves state-of-the-art results terms F1 score by 4.7% on ambiguous words 1.24% over all cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effects of diacritics on Turkish information retrieval

We investigate the effects of improper use of diacritics in the Turkish alphabet on information retrieval. A diacritic is simply a supplementary sign added to a letter to change the sound value of the letter, and the Turkish alphabet has 5 special letters derived from Latin by adding different diacritics. The statistical analysis performed in this study shows that retrieval performance signific...

متن کامل

CAPS: Context Aware Personalized POI Sequence Recommender System

The revolution of World Wide Web (WWW) and smart-phone technologies have been the key-factor behind remarkable success of social networks. With the ease of availability of check-in data, the location-based social networks (LBSN) (e.g., Facebook, etc.) have been heavily explored in the past decade for Point-of-Interest (POI) recommendation. Though many POI recommenders have been defined, most of...

متن کامل

Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction

We demonstrate that an attention-based encoder-decoder model can be used for sentence-level grammatical error identification for the Automated Evaluation of Scientific Writing (AESW) Shared Task 2016. The attention-based encoder-decoder models can be used for the generation of corrections, in addition to error identification, which is of interest for certain end-user applications. We show that ...

متن کامل

Sequence to Sequence Modeling for User Simulation in Dialog Systems

User simulators are a principal offline method for training and evaluating human-computer dialog systems. In this paper, we examine simple sequence-to-sequence neural network architectures for training end-to-end, natural language to natural language, user simulators, using only raw logs of previous interactions without any additional human labelling. We compare the neural network-based simulat...

متن کامل

Sequence-Aware Recommender Systems

Characterization. Adopting the formalisms of [3], we can describe the problem at a more formal, abstract level as follows. Let C be a set of users and I a set of recommendable items. In contrast to matrix-completion problems, we are not interested in predicting a utility value for each i ∈ I and for each c ∈ C , but in computing an ordered list of objects L of length k for each user, where each...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Turkish Journal of Electrical Engineering and Computer Sciences

سال: 2022

ISSN: ['1300-0632', '1303-6203']

DOI: https://doi.org/10.55730/1300-0632.3948